This document explains how various visualisation concepts were used to derive insights for airline operators based on a dataset of flight delays and cancellations. This dataset is available on kaggle and originates from the US DOT flight statistics web page, which is official and open source. This base data contains 293k data points and 21 features, and will subsequently be augmented with more recent data and additional information on airports. A brief description of the data fields in the base dataset is shown below:
year
month
carrier : Abbreviation of carrier
carrier_name : the actual carrier name
airport : Abbreviation of airbort
airport_name : the actual airport name
arr_flights: Number of flights arrived the airport.
arr_del15 : Number of flights delayed.
carrier_ct: Number of flights delayed due to air carrier
weather_ct: Number of flights delayed due to weather.
nas_ct: Number of flights delayed due to National Aviation System ( non-extreme weather conditions, airport operations, heavy traffic volume, and air traffic control ) check more in here%3A,volume%2C%20and%20air%20traffic%20control.)
security_ct: Number of flights delayed due to security
late_aircraft_ct: Number of flights delayed due to a previous flight.
arr_cancelled: Number of flight that has been cancelled.
arr_diverted: Number of flight that has been diverted.
arr_delay: time of delayed flights.
carrier_delay: time of delayed flights due to air carrier.
weather_delay: time of delayed flights due to weather.
nas_delay:time of delayed flights due to National Aviation System.
security_delay:time of delayed flights due to security.
late_aircraft_delay:time of delayed flights due to a previous flight.
For the target audience of airline operators, we identify three hypothesis and explain how they are useful to airline operators:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
The first hypothesis is that the overall reliability of US airlines has improved over time and is expected to continue improving in the future. In this analysis, we will use the flight punctuality rate and average delay time to represent reliability. To investigate this hypothesis, we first group the data by year, calculate the overall punctuality rate and average delay time of each year, and plot the data on a graph. On the same graph, we perform linear regressions on both the punctuality rate data and average delay time data to observe the trends. In the second graph, we break down the delay rate and average delay time into different delay factors to analyze how each factor has changed individually. In the third graph, we create a correlation matrix of delayed factors to investigate the correlation between them.
# Load the data
data = pd.read_csv('airline delay causes.csv')
data.head()
| year | month | carrier | carrier_name | airport | airport_name | arr_flights | arr_del15 | carrier_ct | weather_ct | ... | security_ct | late_aircraft_ct | arr_cancelled | arr_diverted | arr_delay | carrier_delay | weather_delay | nas_delay | security_delay | late_aircraft_delay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009 | 11 | AS | Alaska Airlines Inc. | BRW | Barrow, AK: Wiley Post/Will Rogers Memorial | 65.0 | 14.0 | 3.61 | 1.06 | ... | 0.0 | 6.19 | 0.0 | 1.0 | 920.0 | 592.0 | 34.0 | 78.0 | 0.0 | 216.0 |
| 1 | 2009 | 11 | AS | Alaska Airlines Inc. | BUR | Burbank, CA: Bob Hope | 85.0 | 5.0 | 1.00 | 0.00 | ... | 0.0 | 0.98 | 0.0 | 0.0 | 134.0 | 25.0 | 0.0 | 70.0 | 0.0 | 39.0 |
| 2 | 2009 | 11 | AS | Alaska Airlines Inc. | CDV | Cordova, AK: Merle K Mudhole Smith | 58.0 | 8.0 | 2.04 | 1.00 | ... | 0.0 | 2.96 | 4.0 | 2.0 | 586.0 | 174.0 | 37.0 | 62.0 | 0.0 | 313.0 |
| 3 | 2009 | 11 | AS | Alaska Airlines Inc. | DCA | Washington, DC: Ronald Reagan Washington National | 88.0 | 11.0 | 1.99 | 0.00 | ... | 0.0 | 0.00 | 0.0 | 0.0 | 266.0 | 65.0 | 0.0 | 201.0 | 0.0 | 0.0 |
| 4 | 2009 | 11 | AS | Alaska Airlines Inc. | DEN | Denver, CO: Denver International | 123.0 | 5.0 | 3.47 | 0.00 | ... | 0.0 | 0.00 | 0.0 | 0.0 | 81.0 | 57.0 | 0.0 | 24.0 | 0.0 | 0.0 |
5 rows × 21 columns
# Filter the data to only include years 2003-2019 (exclude 2020)
data = data[data['year'].between(2003, 2019)]
# Group the data by year and calculate the overall punctuality rate and average delay time
grouped_data = data.groupby('year')[['arr_flights', 'arr_del15', 'arr_cancelled', 'arr_delay']].sum().reset_index()
grouped_data['punctuality_rate'] = (grouped_data['arr_flights'] - grouped_data['arr_del15'] - grouped_data['arr_cancelled']) / grouped_data['arr_flights'] * 100
grouped_data['avg_delay_time'] = grouped_data['arr_delay'] / grouped_data['arr_del15']
# Perform linear regression on the punctuality rate data
x1 = grouped_data['year']
y1 = grouped_data['punctuality_rate']
coeffs1 = np.polyfit(x1, y1, 1)
m1 = coeffs1[0]
b1 = coeffs1[1]
# Perform linear regression on the average delay time data
x2 = grouped_data['year']
y2 = grouped_data['avg_delay_time']
coeffs2 = np.polyfit(x2, y2, 1)
m2 = coeffs2[0]
b2 = coeffs2[1]
# Create the line chart
fig, ax1= plt.subplots()
color1 = 'tab:red'
ax1.set_xlabel('Year')
ax1.set_ylabel('Punctuality Rate (%)', color=color1)
ax1.plot(x1, y1, color=color1, label='Punctuality Rate')
ax1.tick_params(axis='y', labelcolor=color1)
ax1.plot(x1, m1*x1+b1, color=color1, linestyle='--', label='Punctuality Rate Regression Line')
plt.legend(loc='upper left')
ax2 = ax1.twinx() # instantiate a second axes that shares the same x-axis
color2 = 'tab:blue'
ax2.set_ylabel('Average Delay Time (minutes)', color=color2)
ax2.plot(x2, y2, color=color2, label='Average Delay Time in dealyed flight')
ax2.tick_params(axis='y', labelcolor=color2)
ax2.plot(x2, m2*x2+b2, color=color2, linestyle='--', label='Average Delay Time Regression Line')
plt.legend(loc='upper right')
# Customize the layout of the chart
fig.tight_layout()
plt.title('Overall Punctuality Rate and Average Delay Time Over Time')
plt.show();
# Create a figure with two subplots
fig, ax = plt.subplots(1, 2, figsize=(16, 6))
# Calculate the percentage of delayed flights due to different reasons for each year
delayed_by_carrier = data.groupby('year')['carrier_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_weather = data.groupby('year')['weather_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_nas = data.groupby('year')['nas_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_security = data.groupby('year')['security_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
delayed_by_late_aircraft = data.groupby('year')['late_aircraft_ct'].sum() / data.groupby('year')['arr_flights'].sum() * 100
# Create a single area chart for all delay factors in the first subplot
ax[0].stackplot(delayed_by_carrier.index, delayed_by_carrier, delayed_by_weather, delayed_by_nas, delayed_by_security, delayed_by_late_aircraft, labels=['Air Carrier', 'Weather', 'NAS', 'Security', 'Late Aircraft'])
# Customize the layout of the first chart
ax[0].set_title('Percentage of Delayed Flights Due to Different Reasons Over Time')
ax[0].set_xlabel('Year')
ax[0].set_ylabel('Percentage of Delayed Flights')
ax[0].legend()
# Calculate the average delayed time due to different reasons for each year
delayed_by_carrier = data.groupby('year')['carrier_delay'].sum() / data.groupby('year')['carrier_ct'].sum()
delayed_by_weather = data.groupby('year')['weather_delay'].sum() / data.groupby('year')['weather_ct'].sum()
delayed_by_nas = data.groupby('year')['nas_delay'].sum() / data.groupby('year')['nas_ct'].sum()
delayed_by_security = data.groupby('year')['security_delay'].sum() / data.groupby('year')['security_ct'].sum()
delayed_by_late_aircraft = data.groupby('year')['late_aircraft_delay'].sum() / data.groupby('year')['late_aircraft_ct'].sum()
# Create a line chart with a separate trace for each delay factor in the second subplot
ax[1].plot(delayed_by_carrier.index, delayed_by_carrier, label='Air Carrier')
ax[1].plot(delayed_by_weather.index, delayed_by_weather, label='Weather')
ax[1].plot(delayed_by_nas.index, delayed_by_nas, label='NAS')
ax[1].plot(delayed_by_security.index, delayed_by_security, label='Security')
ax[1].plot(delayed_by_late_aircraft.index, delayed_by_late_aircraft, label='Late Aircraft')
# Customize the layout of the second chart
ax[1].set_title('Average delay time due to different reasons over time')
ax[1].set_xlabel('Year')
ax[1].set_ylabel('Average delay time (minutes)')
ax[1].legend()
# Show the figure
plt.show()
delay_factors = ['carrier_delay', 'weather_delay', 'nas_delay', 'security_delay', 'late_aircraft_delay']
delay_data = data[delay_factors]
correlations = delay_data.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlations, cmap='coolwarm', annot=True, fmt='.2f')
plt.title('correlation matrix between flight delay factors')
plt.show()
The second hypothesis is major airlines were less affected by flight cancellations during the COVID-19 pandemic, which drastically altered the air travel landscape. As air travel gradually recovers to pre-pandemic levels, airline operator's might find it useful to understand how their operations were affected by the pandemic so they can implement effective recovery measures. After all, domestic travel accounts for almost 70% of the revenue for some of these companies. In the following section, we therefore visualise how domestic flights were affected during this period.
"""
In this section, we import the input data from a CSV file and perform some simple preprocessing.
"""
#Setting up Notebook
import numpy as np
import pandas as pd
#Reading Input Files
flightData = pd.read_csv('Airline_Delay_Cause_New.csv')
cancelData = flightData[["year", "month", "carrier_name", "airport", "arr_flights", "arr_cancelled"]] #retain relevant columns
cancelData.dropna()
cancelData.head()
| year | month | carrier_name | airport | arr_flights | arr_cancelled | |
|---|---|---|---|---|---|---|
| 0 | 2022 | 11 | Endeavor Air Inc. | ABY | 86.0 | 0.0 |
| 1 | 2022 | 11 | Endeavor Air Inc. | AEX | 59.0 | 0.0 |
| 2 | 2022 | 11 | Endeavor Air Inc. | AGS | 21.0 | 0.0 |
| 3 | 2022 | 11 | Endeavor Air Inc. | ALB | 98.0 | 3.0 |
| 4 | 2022 | 11 | Endeavor Air Inc. | ATL | 1143.0 | 3.0 |
"""
In this section, we compute and aggregage the flight cancellation statistics for each airline carrier by month. This code takes
1-2 minutes to run.
"""
from datetime import datetime
#Creating Dictionary to Store Cancellation Rate
cancelAirlineDict = pd.Series(index=cancelData.carrier_name).to_dict()
for key in cancelAirlineDict.keys():
cancelAirlineDict[key] = []
#Cleaning and Processing Input Data
prevCarrier = cancelData["carrier_name"][0]
prevMonth = cancelData["month"][0]
numFlights = 0
numCancel = 0
arrLen = 0
date = []
for i in cancelData.index:
#Checking that All Airlines have Entries for Each Month
if cancelData.iloc[i]["month"] != prevMonth:
arrLen += 1
for key,value in cancelAirlineDict.items():
if len(value) < arrLen:
cancelAirlineDict[key].append(0)
if len(value) > arrLen:
cancelAirlineDict[key] = cancelAirlineDict[key][:-1]
timestamp = str(cancelData.iloc[i]["year"]) + "/" + str(cancelData.iloc[i]["month"]) + "/01"
date.append(datetime.strptime(timestamp, '%Y/%m/%d'))
#Cumulatively adding Cancellation Rate Parameters
if cancelData.iloc[i]["carrier_name"] == prevCarrier:
numFlights += cancelData.iloc[i]["arr_flights"]
numCancel += cancelData.iloc[i]["arr_cancelled"]
#Calculating Cancellation Rate and Reset Boiler Plate
else:
if np.isnan(numFlights) or np.isnan(numCancel):
cancelAirlineDict[prevCarrier].append(0)
else:
cancelAirlineDict[prevCarrier].append(round(numCancel/numFlights, 3))
numFlights = cancelData.iloc[i]["arr_flights"]
numCancel = cancelData.iloc[i]["arr_cancelled"]
prevCarrier = cancelData.iloc[i]["carrier_name"]
prevMonth = cancelData.iloc[i]["month"]
#Preprocessing for Final Round
timestamp = str(cancelData.iloc[-1]["year"]) + "/" + str(cancelData.iloc[-1]["month"]) + "/01"
date.append(datetime.strptime(timestamp, '%Y/%m/%d'))
cancelAirlineDict[prevCarrier].append(numCancel/numFlights)
arrLen += 1
for key,value in cancelAirlineDict.items():
if len(value) < arrLen:
cancelAirlineDict[key].append(0)
if len(value) > arrLen:
cancelAirlineDict[key] = cancelAirlineDict[key][:-1]
#Calculating Average Cancellation Rates
cancelAirlineAv = []
for i in range(0,arrLen):
cumSum = 0
for key,value in cancelAirlineDict.items():
cumSum += value[i]
cancelAirlineAv.append(cumSum/len(cancelAirlineDict.keys()))
C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\1947007629.py:8: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. cancelAirlineDict = pd.Series(index=cancelData.carrier_name).to_dict()
We are now ready to examine how COVID has affected flight cancellations We plot the cancellation rate for airlines over the years (rate is preferred over the nominal cancellation numbers to account for the different numbers of flight across carriers). We observe a sigificant spike in 2020, corresponding to the period where the impact of COVID was most disruptive. The plot also shows the flight cancellations were particularly bad for Alliegiant, Southwest, and American Airlines. In contrast, Frontier, Alaska, and United Airlines were found to be more reliabile. The cancellation rate was about 19% at its highest.
"""
In this section, we visualise the flight cancellation rates.
"""
from pandas import Timestamp
import plotly.express as px
from plotly.graph_objs import *
#Aggregating Results
cancelAirline = pd.DataFrame.from_dict(cancelAirlineDict)
cancelAirline['date'] = date
cancelAirline['Average'] = cancelAirlineAv
#Plotting Graphs
fig = px.line(cancelAirline, x="date", y=['Endeavor Air Inc.', 'American Airlines Inc.', 'Alaska Airlines Inc.',
'JetBlue Airways', 'Delta Air Lines Inc.', 'Frontier Airlines Inc.', 'Allegiant Air',
'Hawaiian Airlines Inc.', 'Envoy Air', 'Spirit Air Lines', 'PSA Airlines Inc.',
'SkyWest Airlines Inc.', 'Horizon Air', 'United Air Lines Inc.',
'Southwest Airlines Co.', 'Mesa Airlines Inc.', 'Republic Airline',
'ExpressJet Airlines LLC', 'ExpressJet Airlines Inc.', 'Virgin America', 'Average'],
labels = {"date": "Date", "value": "Cancellation Rate", "variable": "Airline"},
title= "Airline Cancellation Rates")
fig.update_layout(height=615, paper_bgcolor="rgba(0,0,0,0)", plot_bgcolor="rgba(192,192,192,0.2)")
fig.for_each_trace(
lambda trace: trace.update(line_color="lightgrey") if trace.name != "Average" else (),
)
fig.show()
From our analysis above, we identified three airlines that were particularly unreliable and another three that were particularly reliable. We conduct a further analysis on the actual number of delays and examine how each airline was affected by COVID.
"""
In this section, we extract the relevant columns from our input file to run the analysis on airline delay times.
"""
delayData = flightData[["year", "carrier_name", "arr_cancelled"]] #retain relevant columns
delayData.dropna()
delayData.head()
| year | carrier_name | arr_cancelled | |
|---|---|---|---|
| 0 | 2022 | Endeavor Air Inc. | 0.0 |
| 1 | 2022 | Endeavor Air Inc. | 0.0 |
| 2 | 2022 | Endeavor Air Inc. | 0.0 |
| 3 | 2022 | Endeavor Air Inc. | 3.0 |
| 4 | 2022 | Endeavor Air Inc. | 3.0 |
"""
In this section, we aggregate the number of cancellations of the six selected airlines: Allegiant, Southwest, American,
Frontier, Alaska, and United Airlines. This code takes about 1 minute to run.
"""
#Creating Arrays to Store Cancellation Parameters
year = []
airline = []
cancel =[]
#Cleaning and Processing Input Data
for i in delayData.index:
if delayData.iloc[i]["carrier_name"] in ["Allegiant Air", "Southwest Airlines Co.", "American Airlines Inc.",
"Frontier Airlines Inc.", "Alaska Airlines Inc.", "United Air Lines Inc."]:
year.append(delayData.iloc[i]["year"])
airline.append(delayData.iloc[i]["carrier_name"])
cancel.append(delayData.iloc[i]["arr_cancelled"])
#Aggregating Results
delayAirline = pd.DataFrame(data={"Year": year, "Airline": airline, "Cancellations": cancel})
We observe that the year 2020 contained noticeable increases in both the average number of cancellations and variability in the number of cancellations. Note that we cannot simply compare the number of cancellations across airlines as these are nominal figures and each airlines flies a different number of routes. Compared to their usual reliability levels, Allegiant Airlines had noticeably more flight cancellations. Major airlines such as Southwest and American airlines have been better able to cope with the pandemic as the number of flight cancellations did not increase much in that period. The variablity in number of cancellations, however, did increase, suggesting that some airports were more affected by flight cancellations on some months. In contrast, the changes in cancellations for Allegiant Air are more pronounced in 2020. Therefore, by both measures of reliability (cancellation rate, reliability compared to past years), Alliegiant Airline performed more poorly.
"""
In this section, we visualise the change in flight cancellation over the years for the selected airlines.
"""
from plotnine import ggplot, aes, facet_wrap, scales, geom_boxplot, labs
(
ggplot(delayAirline, aes('Year', 'Cancellations', color='Airline'))
+ geom_boxplot(aes(x='factor(Year)', y='Cancellations'))
+ scales.scale_y_log10()
+ facet_wrap('~Airline')
+ labs(title='Cancellations per airport per month')
)
C:\Users\Tan Ning Xuan\anaconda3\lib\site-packages\pandas\core\arraylike.py:397: RuntimeWarning: divide by zero encountered in log10 C:\Users\Tan Ning Xuan\anaconda3\lib\site-packages\plotnine\layer.py:333: PlotnineWarning: stat_boxplot : Removed 15156 rows containing non-finite values.
<ggplot: (132280935927)>
We can go one step further to help one of these airlines with poorer reliability improve. From the data, we can identify which destinations were more impacted by cancellations so the airlines know where to best channel their resources to minimise flight cancellations. In the following section, we pick Allegiant Air as the example for our analysis.
"""
In this section, we extract the relevant columns from our input file to run the reliability analysis at the route level.
"""
#Reading Input File
airportData = flightData[["year", "carrier_name", "airport", "arr_cancelled"]] #retain relevant columns
airportData.dropna()
airportData.head()
#Reading Additional Input File
#coordinatesData = pd.read_csv('Coordinates.csv')
| year | carrier_name | airport | arr_flights | arr_cancelled | |
|---|---|---|---|---|---|
| 0 | 2022 | Endeavor Air Inc. | ABY | 86.0 | 0.0 |
| 1 | 2022 | Endeavor Air Inc. | AEX | 59.0 | 0.0 |
| 2 | 2022 | Endeavor Air Inc. | AGS | 21.0 | 0.0 |
| 3 | 2022 | Endeavor Air Inc. | ALB | 98.0 | 3.0 |
| 4 | 2022 | Endeavor Air Inc. | ATL | 1143.0 | 3.0 |
"""
In this section, we aggregate the cancellation numbers for each destination of Allegiant Air for the years of 2018-2019 and
2020 to examine the difference before and after the pandemic. This code takes about 1 minute to run.
"""
#Creating Dictionaries to Store Cancellation Rate Parameters
cancel1819 = pd.Series(index=airportData.airport).to_dict()
cancel20 = pd.Series(index=airportData.airport).to_dict()
difference = pd.Series(index=airportData.airport).to_dict()
for key in cancel1819.keys():
cancel1819[key] = 0
cancel20[key] = 0
#Cleaning and Processing Input Data
for i in airportData.index:
if airportData.iloc[i]["carrier_name"] == "Allegiant Air":
if airportData.iloc[i]["year"]==2018 or airportData.iloc[i]["year"]==2019:
cancel1819[airportData.iloc[i]["airport"]] += airportData.iloc[i]["arr_cancelled"]
elif airportData.iloc[i]["year"]==2020:
cancel20[airportData.iloc[i]["airport"]] += airportData.iloc[i]["arr_cancelled"]
#Aggregating Results
for key in difference.keys():
if cancel20[key]>0 and cancel1819[key]>0:
difference[key] = cancel20[key] - cancel1819[key]
airportDiff = pd.DataFrame(difference.items(), columns=['FAA Code', 'Difference'])
airportDiff.dropna(inplace=True)
C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\356042384.py:6: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\356042384.py:7: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning. C:\Users\Tan Ning Xuan\AppData\Local\Temp\ipykernel_13292\356042384.py:8: FutureWarning: The default dtype for empty Series will be 'object' instead of 'float64' in a future version. Specify a dtype explicitly to silence this warning.
"""
In this section, we import two new files: one containing the list of airport codes, and another containing their GPS
coordinates. We combine these data with our cancellation rate data from the preceding section. We then use an API developed
by the US FCC to obtain the FIPS codes for each airport. This data will subsequently be used to plot the choropleth. This code
takes about 2-3 minutes to run.
"""
import requests
#Defining Helper Function to Obtain FIPS Code
def latlong2fips (latitude, longitude):
url = "https://geo.fcc.gov/api/census/block/find?format=json&latitude=%f&longitude=%f" % (latitude, longitude)
r = requests.get(url)
return r.json()['County']['FIPS']
#Importing Additional Files
codeData = pd.read_csv('Airport_Codes.csv')
codeData.dropna()
GPSData = pd.read_csv('US_Airport_Database.csv')
GPSData.dropna()
#Combining Data
Fips = []
airportNames = []
for faa in airportDiff["FAA Code"]:
idx = codeData.index[codeData['FAA'] == faa][0] #find matching FAA code
airportNames.append(codeData.iloc[idx]["Airport"])
icao = codeData.iloc[idx]["ICAO"]
idx2 = GPSData.index[GPSData['ICAO'] == icao][0] #find matching ICAO code
lat,long = GPSData.iloc[idx2]["LATITUDE"], GPSData.iloc[idx2]["LONGITUDE"]
fip = latlong2fips(lat, long) #find matching coordinates
Fips.append(fip)
airportDiff["Fips"] = Fips
airportDiff["Name"] = airportNames
We now visualise this difference in cancellation rates on a choropleth map. From the data, we can readily identify areas which Allegiant Air has suffered more cancellations during the COVID period than usual: Harry Reid (NV) and Orlando Sanford (FL). They should therefore focus more of their supporting resources in these areas to recover more quickly from the pandemic.
"""
In this section, we visualise the difference in flight cancellations before and after COVID. This code takes about 1 minute
to run.
"""
from urllib.request import urlopen
import json
#Plotting Map
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
counties = json.load(response)
fig = px.choropleth_mapbox(airportDiff, geojson=counties, locations='Fips', color='Difference',
color_continuous_scale="Bluered", mapbox_style="carto-positron", zoom=3,
center={"lat":37.0902,"lon":-95.7129}, opacity=0.5, hover_name='Name')
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()
The third hypothesis is "Airlines in the airports of their headquarters should have a lower delay rate". To argue this point, we firstly select 9 ongoing operated airlines and corresponding 9 headquarter cities (may involve duplicates), computed average delay rate for these airlines in the cities, and produced ordered bar charts for comparison. After generating charts, we analyze any potential trend and dive into what and why the outlier airlines occur.
Airline_Delay_Cause.csv as variable airline_delayarr_del15) over the number of arrival flights (arr_flights)# read the first dataset
airline_delay = pd.read_csv("airline delay causes.csv")
# compute delay rate from existing columns
airline_delay["delay_prop"] = 100 * airline_delay["arr_del15"] / airline_delay["arr_flights"]
airline_delay.dropna(inplace = True)
airline_delay.head(n = 3)
| year | month | carrier | carrier_name | airport | airport_name | arr_flights | arr_del15 | carrier_ct | weather_ct | ... | late_aircraft_ct | arr_cancelled | arr_diverted | arr_delay | carrier_delay | weather_delay | nas_delay | security_delay | late_aircraft_delay | delay_prop | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2009 | 11 | AS | Alaska Airlines Inc. | BRW | Barrow, AK: Wiley Post/Will Rogers Memorial | 65.0 | 14.0 | 3.61 | 1.06 | ... | 6.19 | 0.0 | 1.0 | 920.0 | 592.0 | 34.0 | 78.0 | 0.0 | 216.0 | 21.538462 |
| 1 | 2009 | 11 | AS | Alaska Airlines Inc. | BUR | Burbank, CA: Bob Hope | 85.0 | 5.0 | 1.00 | 0.00 | ... | 0.98 | 0.0 | 0.0 | 134.0 | 25.0 | 0.0 | 70.0 | 0.0 | 39.0 | 5.882353 |
| 2 | 2009 | 11 | AS | Alaska Airlines Inc. | CDV | Cordova, AK: Merle K Mudhole Smith | 58.0 | 8.0 | 2.04 | 1.00 | ... | 2.96 | 4.0 | 2.0 | 586.0 | 174.0 | 37.0 | 62.0 | 0.0 | 313.0 | 13.793103 |
3 rows × 22 columns
"""
We filter the data with only airports and airlines we selected
"""
# record the headquarter airport of each selected airline
headquarter = {
"UA" : "ORD",
"DL" : "ATL",
"F9" : "DEN",
"NK" : "MCO",
"MQ" : "DFW",
"YV" : "PHX",
"AA" : "DFW",
"AS" : "SEA",
"VX" : "SFO"
}
# filter the data with only these airports and airlines (carriers)
airports = list(headquarter.values())
carriers = list(headquarter.keys())
df = airline_delay.query("carrier in @carriers & airport in @airports").reset_index(drop = True)
"""
Compute the average delay rate of each airline in each airport. If an airline doesn't operate in an airport, create a new row
with delay_prop as 0. Round each rate with 2 decimal places and change zero rate as "no flight"
"""
# Compute the average delay rate of each airline in each airport
temp = df.groupby(["carrier","airport"])["delay_prop"].mean().reset_index()
for c in carriers:
if temp["carrier"].value_counts()[c] != len(carriers):
temp_df = temp.query("carrier == @c")
## append no flight rows
for a in airports:
if a not in list(temp_df["airport"]):
temp = temp.append({
"carrier" : c,
"airport" : a,
"delay_prop" : 0
}, ignore_index = True)
# round to 2 decimal places and change zero rate to "no flight"
temp["labels"] = temp["delay_prop"].apply(lambda x : str(round(x, 2)) + "%" if x != 0 else "no flight")
"""
set the color code for each airline because our target audience is the airline officers, different color codes can highlight
the headquarter more obvious and easily found draw bar chart as subplot for each carrier and change the title and ticks properly
"""
# color code for each airline (searched on Google)
color_map = {
"DL" : "#E3132C",
"UA" : "#005daa",
"F9" : "#248168",
"NK" : "#FFD700",
"AS" : "#00385F",
"YV" : "#2B1B17",
"AA" : "#086591",
"VX" : "#F70D1A",
"MQ" : "#1F45FC"
}
fig, ax = plt.subplots(3, 3, figsize = (22,16))
for i in range(0, 3):
for j in range(0, 3):
## extract the current carrier and corresponding data, but will not involve no flight data for the beauty of chart
c = carriers[3 * i + j]
temp_c = temp.query("carrier == @c").sort_values(by = "delay_prop")
temp_c = temp_c.query("labels != 'no flight'")
## draw a bar for each airport: if the airport is in the headquarter, use the specific color code, otherwise light grey
colors = []
y_axis = np.arange(temp_c.shape[0])
for airport in list(temp_c["airport"]):
if airport == headquarter[c]:
colors.append(color_map[c])
else:
colors.append("#E5E4E2")
## setup the bar label, tick label, and bar chart title
plot = ax[i,j].barh(width = tuple(temp_c["delay_prop"]), y = y_axis, color = colors)
ax[i,j].bar_label(plot, labels = list(temp_c["labels"]), padding = 3)
ax[i,j].set_yticks(y_axis, tuple(temp_c["airport"]))
ax[i,j].set_xlim(0, 35)
ax[i,j].set_title("Carrier " + c, fontsize = 13, fontweight = "bold")
ax[i,j].set_xlabel("Avg Delay Rate (%)")
# setup the overall chart title
fig.suptitle("Average Delay Rate of Each Airline (Headquarter Highlighted)", fontsize = 25, fontweight = "bold")
plt.show()
From the chart above, we can see that 7 out 9 airlines have the least or relatively lower delay rate in their respective headquarters. This can be a convincing evidence to show that airlines in their headquarters tend to have a lower delay rate. But there are 2 outliers: Alaska Airline (AS) and Virgin America Airline (VX), which have a high delay rate in Seattle and San Francisco. We will focus on analyzing why it could happen and how to improve.
To analyze Alaska Airline in its headquarter Seattle, we compare the average proportion of each delay factor of AS in Seattle and across the US.
"""
Compute the average proportion for each factor, Alaska Airline only, in Seattle and the US overall
"""
tmp_y = ['carrier_ct', 'weather_ct', 'nas_ct', 'security_ct', 'late_aircraft_ct']
# across the US
as_factor = airline_delay.query("carrier == 'AS'")[tmp_y].mean()
as_factor_prop = 100 * as_factor / as_factor.sum()
# only Seattle
sea_factor = airline_delay.query("carrier == 'AS' and airport == 'SEA'")[tmp_y].mean()
sea_factor_prop = 100 * sea_factor / sea_factor.sum()
"""
Generate a bar chart for Seattle data and a dash line chart for the US overall
"""
# bar chart for Seattle data
plt.bar(sea_factor_prop.index, sea_factor_prop.values, alpha = 0.5, color = "#CB6D51")
# dash line chart for the US overall
plt.plot(as_factor_prop.index, as_factor_prop.values, "o--", color = "green")
# set up appropriate chart title, ticks, and labels
plt.title("Proportion of Delay Factors of Airline AS \n Across the US and in Airport SEA")
plt.xticks(np.arange(0,5), ["Carrier", "Weather", "NAS", "Security", "Late Aircraft"])
plt.ylabel("Proporation (%)")
plt.legend(["Across the US", "Only in SEA"], loc = "upper left")
plt.show()
From the chart above, we can observe that for Alaska Airline, the delay cause of NAS and late aircraft in Seattle is much high than across the US. The late aircraft is analyze in the first hypothesis, and the potential suggestion can be applied here that airlines should design the flight timetable not that dense to avoid this kind of delay. For the NAS part, we search some online news about the delays caused by the system and the airline's operation, and then provided insights based on these information.
The Virgin America (VX) airline is headquartered at San Francisco, but its average delay rate at San Francisco is not relatively lower than at other selected airports. To analyze this pheonomenon, we try to explore the problem from both the perspectives of SFO airport and VX airline.
"""
Average delay rate of each airline at SFO airport.
"""
# subset of dataset at airport SFO
sfo = airline_delay[airline_delay['airport'] == 'SFO']
# calculate the average delay rate by the group of each carrier, sort the new dataframe by value and reset the index
s = sfo.groupby('carrier')['delay_prop'].mean()
sorted_s = s.sort_values(ascending=False)
sorted_s = sorted_s.reset_index('carrier')
# set the length range of x axis and rename it by the carriers
x_range = range(0, len(s))
plt.xticks(x_range, sorted_s['carrier'])
# plot all the lollipops for each airline
plt.vlines(sorted_s['carrier'], ymin=0, ymax=sorted_s['delay_prop'], color='skyblue')
plt.scatter(sorted_s['carrier'], sorted_s['delay_prop'], color='skyblue', s=30)
# highligh the lollipop of airline VX in the chart
plt.vlines('VX', ymin=0, ymax=s.loc['VX'], color='orange')
plt.scatter('VX', s.loc['VX'], color='orange', s=30)
# rename the plot and display it
plt.title("Average delay rate of all airlines at SFO airport")
plt.show()
According to the lollipop chart above, it can be seen that the majority of airlines own a delay rate over 20% at San Francisco airport (SFO). Therefore, delays of flights at SFO tend to be a common problems across all airlines. Because of the FAA safety regulations and the limitations of runways in SFO, the airlines that arrive at SFO must land single-file with greater time intervals (SFO website, 2023).
To further validate this possible explanation, we try to compare the detailed delay factors of VX at SFO and at all airports in US.
"""
Delay factors of VX at SFO and across US.
"""
# only SFO
sfo_factor = airline_delay.query("carrier == 'VX' and airport == 'SFO'")[tmp_y].mean()
sfo_factor_prop = 100 * sfo_factor / sfo_factor.sum()
# bar chart for San Francisco data
plt.bar(sfo_factor_prop.index, sfo_factor_prop.values, alpha = 0.5, color = "#CB6D51")
# dash line chart for the US overall
plt.plot(as_factor_prop.index, as_factor_prop.values, "o--", color = "green")
# set up appropriate chart title, ticks, and labels
plt.title("Proportion of Delay Factors of Airline VX \n Across the US and in Airport SFO")
plt.xticks(np.arange(0,5), ["Carrier", "Weather", "NAS", "Security", "Late Aircraft"])
plt.ylabel("Proporation (%)")
plt.legend(["Across the US", "Only in SFO"], loc = "upper left")
plt.show()
According to the integrated graph of barchart and dash line, delays resulting from factor 'NAS' at SFO for VX is higher than its performance across US by about 15%, which is intuitively in accordance with the official explanation above.
In addition, VX airline deals with other factors better at SFO airport except for the landing limitation. This can still be a kind of corroboration of the hypothesis of this section.